Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ni and Hastie, 2007]. The outliers are defined by ሼݕ∈ݕ௜: ݕ൐

ሻ൅ܫܴܳሺܠ௜, ܡ௜ሻሽ. The OS t statistics is defined as below, where

number of the case expressions. The OS p value is also calculated

permutation approach.

ݐ^୓ୗൌ

∑

ሺݕ௜െߣሻ

௠

௜ୀଵ

(6.11)

erator was further revised in the outlier robust t statistic algorithm

n ORT, the range of outlier discovery was enlarged using the

rter percentile [Wu, 2007]. OS and ORT employ a similar

strategy. OS makes a division between the outlier and non-outlier

ns while ORT divides the control expressions using the pooled

ns. ORT uses the same statistic but defines the outliers in relation

e expressions only:

ሼݕ∈ݕ௜: ݕ൐ݍ଻ହሺݔ௜ሻ൅IQRሺݔ௜ሻሽ

(6.12)

OST

imum ordered subset t statistic algorithm (MOST) employs a

method [Lian, 2008]. For both the control expressions and the case

ns, two median values are calculated. They are ߤ௫ൌmedianሺܠሻ

medianሺܡሻ. Based on these two median values, the median of

ences between expressions and expression medians is calculated

following equation,

߱ൌ1.4826 ൈmedian൛|ܠെߤ௫|, หܡെߤ௬หൟ

(6.13)

andard Gaussian distribution is generated as a benchmark

on, which has a zero mean and a unit standard deviation. The

such a standard Gaussian distribution is named by ߴ. Suppose k